Goal
Using the collected from existing customers, build a model that will help the marketing team identify potential customers who are relatively more likely to subscribe term deposit and thus increase their hit ratio.
Resources Available
The historical data for this project is available in file
https://archive.ics.uci.edu/ml/datasets/Bank+Marketing
Deliverable – 1 (Exploratory data quality report reflecting the following) 1. Univariate analysis a. Univariate analysis – data types and description of the independent attributes which should include (name, meaning, range of values observed, central values (mean and median), standard deviation and quartiles, analysis of the body of distributions / tails, missing values, outliers.
Multivariate analysis a. Bi-variate analysis between the predictor variables and target column. Comment on your findings in terms of their relationship and degree of relation if any. Presence of leverage points. Visualize the analysis using boxplots and pair plots, histograms or density curves. Select the most appropriate attributes.
Strategies to address the different data challenges such as data pollution, outliers and missing values.
Deliverable – 2 (Prepare the data for analytics) 1. Load the data into a data-frame. The data-frame should have data and column description. 2. Ensure the attribute types are correct. If not, take appropriate actions. 3. Transform the data i.e. scale / normalize if required 4. Create the training set and test set in ration of 70:30
Deliverable – 3 (create the ensemble model) 1. Write python code using scikitlearn, pandas, numpy and others in Jupyter notebook to train and test the ensemble model. 2. First create a model using standard classification algorithm. Note the model performance. 3. Use appropriate algorithms and explain why that algorithm in the comment lines. 4. Evaluate the model. Use confusion matrix to evaluate class level metrics i.e..Precision and recall. Also reflect the overall score of the model. 5. Advantages and disadvantages of the algorithm. 6. Build the ensemble models and compare the results with the base model. Note: Random forest can be used only with Decision trees.
Deliverable – 4 (Tuning the model) 1. Discuss some of the key hyper parameters available for the selected algorithm. What values did you initialize these parameters to? 2. Regularization techniques used for the model. 3. Range estimate at 95% confidence for the model performance in production.
Attribute information Input variables: # bank client data: 1 - age (numeric) 2 - job : type of job (categorical: 'admin.','blue collar','entrepreneur','housemaid','management','retired','selfemployed','services','student','technician','unemployed','unknown') 3 - marital : marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed) 4 - education (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','u nknown') 5 - default: has credit in default? (categorical: 'no','yes','unknown') 6 - housing: has housing loan? (categorical: 'no','yes','unknown') 7 - loan: has personal loan? (categorical: 'no','yes','unknown') # related with the last contact of the current campaign: 8 - contact: contact communication type (categorical: 'cellular','telephone') 9 - month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec') 10 - day_of_week: last contact day of the week (categorical: 'mon','tue','wed','thu','fri') 11 - duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model. # other attributes: 12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact) 13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted) 14 - previous: number of contacts performed before this campaign and for this client (numeric) 15 - poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success') # social and economic context attributes 16 - emp.var.rate: employment variation rate - quarterly indicator (numeric) 17 - cons.price.idx: consumer price index - monthly indicator (numeric) 18 - cons.conf.idx: consumer confidence index - monthly indicator (numeric) 19 - euribor3m: euribor 3 month rate - daily indicator (numeric) 20 - nr.employed: number of employees - quarterly indicator (numeric)
Output variable (desired target): 21 - y - has the client subscribed a term deposit? (binary: 'yes','no')
# To enable plotting graphs in Jupyter notebook
%matplotlib inline
import pandas as pd
from sklearn.linear_model import LogisticRegression
# importing ploting libraries
import matplotlib.pyplot as plt
#importing seaborn for statistical plots
import seaborn as sns
#Let us break the X and y dataframes into training set and test set. For this we will use
#Sklearn package's data splitting function which is based on random function
from sklearn.model_selection import train_test_split
# This is used for numerical python
import numpy as np
# calculate accuracy measures and confusion matrix
from sklearn import metrics
# Reading the csv file and making a dataframe out of it using pandas
df=pd.read_csv("bank-full.csv")
bank_full_df=df
# watching the first few data and the type of data in the dataframe
bank_full_df.head().transpose()
#size
bank_full_df.shape
#Info
bank_full_df.info() # The are some columns with datatype as object the same needs to be converted to numerical
#Describe
bank_full_df.describe().transpose() # Use for statistical analysis to study for potential outliers
#Checking for null value
bank_full_df.isnull().sum() # Based on the observation no null values are present
#Removing columns which are irrelevant and/or impacting the target column which is not required
bank_full_df = bank_full_df.drop(['day'],axis=1) # removing the day column
bank_full_df.Target.replace(('yes','no'),(1,0),inplace=True) # converting the target column to binary 0 and 1
# Note for target column '0' is for non-defaulters(no value) i.e clients who do not subscribe a term deposit and 1 stands for defaulters(yes value) i.e client who subscribe a term deposit.
# Decision tree in Python can take only numerical / categorical colums. It cannot take string / obeject types.
# The following code loops through each column and checks if the column type is object then converts those columns
# into categorical with each distinct value becoming a category or code.
for feature in bank_full_df.columns: # Loop through all columns in the dataframe
if bank_full_df[feature].dtype == 'object': # Only apply for columns with categorical strings
bank_full_df[feature] = pd.Categorical(bank_full_df[feature]).codes # Replace strings with an integer
sns.distplot(bank_full_df['duration'])
sns.distplot(bank_full_df['age'])
sns.distplot(bank_full_df['balance'])
#Above, All the Histogram suggest that data is skewed towards left i.e. existence of skewness brings us to a point that we need to sample the data efficiently while classifiying the train_data and test_data !
fig = plt.figure(1, figsize=(9, 6))
ax1 = fig.add_subplot(211)
bp1 = ax1.boxplot(bank_full_df.balance,0,'')
ax2 = fig.add_subplot(212)
bp2 = ax2.boxplot(bank_full_df.balance,0,'gD')
plt.show()
fig = plt.figure(1, figsize=(6, 6))
ax = fig.add_subplot(211)
bp = ax.boxplot(bank_full_df.age,0,'')
ax = fig.add_subplot(212)
bp = ax.boxplot(bank_full_df.age,0,'gD')
plt.show()
fig = plt.figure(1, figsize=(9, 6)) ax1 = fig.add_subplot(211) bp1 = ax1.boxplot(bank_full_df.duration,0,'') ax2 = fig.add_subplot(212) bp2 = ax2.boxplot(bank_full_df.duration,0,'gD') plt.show()
#Above boxplot suggest how the data is spread across the dataset Most of the data is lying above the 3rd quantile by multiplication factor of 1.5 i.e. by theortical aspect the data points are outlier for most of the data points.
data = bank_full_df
print(data.columns)
data.head() # Observing the dataframe created by converting the categorical data to numeric
data.shape
data.info() # So its clear that all the columns in the dataframe are now numeric
data.describe().transpose()
sns.pairplot(data) # Examining the nature to the data distributed and the impact of attributes(independent variables) over the classes(target variable) and among each other
plt.figure(figsize=(10,8))
sns.heatmap(data.corr(),
annot=True,
linewidths=.5,
center=0,
cbar=False,
cmap="YlGnBu")
plt.show()
#Breaking the dataset into two parts, X denotes the independent variables or features, y denotes the target variable
X = data.drop("Target", axis=1)
y = data.pop("Target")
print(X.shape , y.shape)
test_size = 0.30 # taking 70:30 training and test set
iterationList=np.random.randint(1,100,10) # this list contains 10 random numbers between 1 to 100 the same will be used in random state in various iterations in different models
from sklearn import model_selection
from sklearn.naive_bayes import GaussianNB
model = GaussianNB()
itr = 1
for i in iterationList:
seed = i
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size,random_state=seed)
# Fitting the mode
model.fit(X_train, y_train)
#Prediction on test set
prediction = model.predict(X_test)
# Accuracy on test set
accuracy = model.score(X_test, y_test)
expected=y_test
print("Iteration ",itr)
itr=itr+1
print()
print("data split random state ",seed)
print("Classification report")
print(metrics.classification_report(expected, prediction))
print("Confusion matrix")
print(metrics.confusion_matrix(expected, prediction))
print("Overall score ",accuracy)
print("----------------------------------------------------")
# From the Naive Bayes study we get to see that the highest overall score is 85% , however the recall value for the defaulters is only 46% showing lack of data for the dafaulters.
# Importing SVC
from sklearn.svm import SVC
# Building the model with Kernel = 'linear'
# Building a Support Vector Machine on train data
svc_model = SVC(C= .1, kernel='rbf', gamma= 1)
# gamma is a measure of influence of a data point. It is inverse of distance of influence. C is complexity of the model
# lower C value creates simple hyper surface while higher C creates complex surface
seed =1 # Random numbmer seeding for reapeatability of the code
test_size = 0.30 # taking 70:30 training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size,random_state=seed)
# Fitting the mode
svc_model.fit(X_train, y_train)
#Prediction on test set
prediction = svc_model.predict(X_test)
# Accuracy on test set
accuracy = svc_model.score(X_test, y_test)
expected=y_test
print("Classification report")
print(metrics.classification_report(expected, prediction))
print("Confusion matrix")
print(metrics.confusion_matrix(expected, prediction))
print("Overall score ",accuracy)
# Unfortunately SVM with kernel as rbf took a long time to execute and we do not get much from this model. The SVM model only predicts for the non defaulters
# importing necessary libraries
from sklearn.neighbors import KNeighborsClassifier
from scipy.stats import zscore
# convert the features into z scores as we do not know what units / scales were used and store them in new dataframe
# It is always adviced to scale numeric attributes in models that calculate distances.
df_z = data.apply(zscore) # converting all attributes to Z scale
df_z.describe().transpose()
X_z = df_z #Fetching all featues/independent columns from z-score dataframe df_z
itr=1
# choosing k value as 3 and assigning weight values based on the distance
NNH = KNeighborsClassifier(n_neighbors= 3 , weights = 'distance' )
for i in iterationList:
seed = i
X_train, X_test, y_train, y_test = train_test_split(X_z, y, test_size=test_size,random_state=seed)
# Fitting the mode
NNH.fit(X_train, y_train)
#Prediction on test set
prediction = NNH.predict(X_test)
# Accuracy on test set
accuracy = NNH.score(X_test, y_test)
expected=y_test
print("Iteration ",itr)
itr=itr+1
print()
print("data split random state ",seed)
print("Classification report")
print(metrics.classification_report(expected, prediction))
print("Confusion matrix")
print(metrics.confusion_matrix(expected, prediction))
print("Overall score ",accuracy)
print("----------------------------------------------------")
itr=1
# choosing k value as 7 and assigning weight values based on the distance
NNH = KNeighborsClassifier(n_neighbors= 7 , weights = 'distance' )
for i in iterationList:
seed = i
X_train, X_test, y_train, y_test = train_test_split(X_z, y, test_size=test_size,random_state=seed)
# Fitting the mode
NNH.fit(X_train, y_train)
#Prediction on test set
prediction = NNH.predict(X_test)
# Accuracy on test set
accuracy = NNH.score(X_test, y_test)
expected=y_test
print("Iteration ",itr)
itr=itr+1
print()
print("data split random state ",seed)
print("Classification report")
print(metrics.classification_report(expected, prediction))
print("Confusion matrix")
print(metrics.confusion_matrix(expected, prediction))
print("Overall score ",accuracy)
print("----------------------------------------------------")
# From both the KNN analysis for n-neighbours 3 and 7 respectively we find relatively good result for n-neighbours as 3. The best result obtained is in iteration 2 with overall accuracy of 88% but again the recall for dafaulters is low due to lack of data.
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_extraction.text import CountVectorizer #DT does not take strings as input for the model fit step....
test_size = 0.30 # taking 70:30 training and test set
iterationList = np.random.randint(1,100,5) # Creting a list of 5 random variables between 1 and 100 the same will be used in different iterations for random state
itr=1;
for i in iterationList:
treeseed = i
dt_model = DecisionTreeClassifier(criterion = 'gini' , random_state=treeseed )
for j in iterationList:
seed=j
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=seed)
dt_model.fit(X_train, y_train)
#Prediction on test set
prediction = dt_model.predict(X_test)
# Accuracy on test set
accuracy = dt_model.score(X_test, y_test)
expected=y_test
print("Iteration ",itr)
itr=itr+1
print()
print("Decision tree criterion gini random state ",treeseed)
print("data split random state ",seed)
print("Classification report")
print(metrics.classification_report(expected, prediction))
print("Confusion matrix")
print(metrics.confusion_matrix(expected, prediction))
print("Overall score ",accuracy)
print("----------------------------------------------------")
itr=1;
for i in iterationList:
treeseed = i
dt_model = DecisionTreeClassifier(criterion = 'entropy' , random_state=treeseed )
for j in iterationList:
seed=j
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=seed)
dt_model.fit(X_train, y_train)
#Prediction on test set
prediction = dt_model.predict(X_test)
# Accuracy on test set
accuracy = dt_model.score(X_test, y_test)
expected=y_test
print("Iteration ",itr)
itr=itr+1
print()
print("Decision tree criterion entropy random state ",treeseed)
print("data split random state ",seed)
print("Classification report")
print(metrics.classification_report(expected, prediction))
print("Confusion matrix")
print(metrics.confusion_matrix(expected, prediction))
print("Overall score ",accuracy)
print("----------------------------------------------------")
# The results obtained from Gini and Entropy are almost similar, we get highest overall accuracy of 88 % but again the recall value for defaulters is less to 49 % due to non availibility of data for defaulters
from sklearn.tree import export_graphviz
from sklearn.externals.six import StringIO
from IPython.display import Image
import pydotplus
train_char_label = ['No', 'Yes']
xvar = data
feature_cols = xvar.columns
dot_data = StringIO()
export_graphviz(dt_model, out_file=dot_data,
filled=True, rounded=True,
special_characters=True,feature_names = feature_cols,class_names=list(train_char_label))
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
graph.write_png('bank_full_tree.png')
Image(graph.create_png())
md =10 # Considering a max-depth of 10 , i.e maximum 10 levels of splitting are allowed
itr=1
for i in iterationList:
treeseed = i
clf_pruned = DecisionTreeClassifier(criterion = 'entropy', max_depth =md, random_state=treeseed)
md=md+1
for j in iterationList:
seed=j
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=seed)
clf_pruned.fit(X_train, y_train)
#Prediction on test set
prediction = clf_pruned.predict(X_test)
# Accuracy on test set
accuracy = clf_pruned.score(X_test, y_test)
expected=y_test
print("Iteration ",itr)
itr=itr+1
print()
print("Regularised Decision tree criterion entropy random state ",treeseed)
print("Max depth ",md)
print("data split random state ",seed)
print("Classification report")
print(metrics.classification_report(expected, prediction))
print("Confusion matrix")
print(metrics.confusion_matrix(expected, prediction))
print("Overall score ",accuracy)
print("----------------------------------------------------")
# Here we get slightly better result compared to the full blown decision tree thus we conclude that the full tree is overfitted
dot_data = StringIO()
export_graphviz(clf_pruned, out_file=dot_data,
filled=True, rounded=True,
special_characters=True,feature_names = feature_cols,class_names=list(train_char_label))
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
graph.write_png('bank_regularized_tree.png')
Image(graph.create_png())
from sklearn.ensemble import BaggingClassifier
itr=1
for i in iterationList:
treeseed = i
bgcl = BaggingClassifier(base_estimator=dt_model, n_estimators=100, random_state=treeseed)
for j in iterationList:
seed=j
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=seed)
bgcl = bgcl.fit(X_train, y_train)
#Prediction on test set
prediction = bgcl.predict(X_test)
# Accuracy on test set
accuracy = bgcl.score(X_test, y_test)
expected=y_test
print("Iteration ",itr)
itr=itr+1
print()
print("Bagging random state ",treeseed)
print("data split random state ",seed)
print("Classification report")
print(metrics.classification_report(expected, prediction))
print("Confusion matrix")
print(metrics.confusion_matrix(expected, prediction))
print("Overall score ",accuracy)
print("----------------------------------------------------")
# For Bagging we get highest accuracy of 90% with a recall for defaulters as 47% showing the dominance nature of the non defaulters.
from sklearn.ensemble import AdaBoostClassifier
itr=1
for i in iterationList:
treeseed = i
abcl = AdaBoostClassifier(base_estimator=dt_model, n_estimators=100, random_state=treeseed)
for j in iterationList:
seed=j
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=seed)
abcl = abcl.fit(X_train, y_train)
#Prediction on test set
prediction = abcl.predict(X_test)
# Accuracy on test set
accuracy = abcl.score(X_test, y_test)
expected=y_test
print("Iteration ",itr)
itr=itr+1
print()
print("AdaBoosting random state ",treeseed)
print("data split random state ",seed)
print("Classification report")
print(metrics.classification_report(expected, prediction))
print("Confusion matrix")
print(metrics.confusion_matrix(expected, prediction))
print("Overall score ",accuracy)
print("----------------------------------------------------")
from sklearn.ensemble import GradientBoostingClassifier
itr=1
for i in iterationList:
treeseed = i
gbcl = GradientBoostingClassifier(n_estimators = 100, random_state=treeseed)
for j in iterationList:
seed=j
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=seed)
gbcl = gbcl.fit(X_train, y_train)
#Prediction on test set
prediction = gbcl.predict(X_test)
# Accuracy on test set
accuracy = gbcl.score(X_test, y_test)
expected=y_test
print("Iteration ",itr)
itr=itr+1
print()
print("GradientBoosting random state ",treeseed)
print("data split random state ",seed)
print("Classification report")
print(metrics.classification_report(expected, prediction))
print("Confusion matrix")
print(metrics.confusion_matrix(expected, prediction))
print("Overall score ",accuracy)
print("----------------------------------------------------")
# In this case we get better result for AdaBoosting that Gradient Boosting mechanism. Although the highest accuracy in Gradient Boosting method is more compared to AdaBoosting but when it comes to recall value for defaulters, AdaBoosting gave us more accurate predictions.
from sklearn.ensemble import RandomForestClassifier
itr=1
for i in iterationList:
treeseed = i
rfcl = RandomForestClassifier(n_estimators = 100, random_state=treeseed)
for j in iterationList:
seed=j
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=seed)
rfcl = rfcl.fit(X_train, y_train)
#Prediction on test set
prediction = rfcl.predict(X_test)
# Accuracy on test set
accuracy = rfcl.score(X_test, y_test)
expected=y_test
print("Iteration ",itr)
itr=itr+1
print()
print("Random Forest random state ",treeseed)
print("data split random state ",seed)
print("Classification report")
print(metrics.classification_report(expected, prediction))
print("Confusion matrix")
print(metrics.confusion_matrix(expected, prediction))
print("Overall score ",accuracy)
print("----------------------------------------------------")
# On employing random forest we get almost similar results for all iterations having overall score of 90 % with recall value for defaulters as 41%. This again shows that non-defaulters are very large in number and we needto increase the dataset for defaulters.
All the models used analysing this bank-full.csv dataset gives an overall score ranging between 88-90 % with a recall value for defaulters(clients who subscribed term deposit) as 40-49%. This clearly shows that the non-defaulters class as dominating. So in order to get a better model we need to collect more data for the defaulter class.